Microfilm , Paper , and OCR : Issues in Newspaper Digitization

نویسندگان

  • John Herbert
  • Kenning Arlitsch
چکیده

by Kenning Arlitsch and John Herbert Kenning Arlitsch and John Herbert are both at the J. Willard Marriott Library, University of Utah. Mr. Arlitsch (kenning.arlitsch© library.utah.edu 295 S. 1500 East, Room 463, Salt Lake City, UT84112) is Head of Information Technology, and Mr. Herbert (John. [email protected] 295 S. 1500 East, Room 418, Salt Lake City, UT 84112) is Program Director Utah Digital Newspapers. They would like to gratefully acknowledge the contributions of Scott Christensen and Frederick Zarndt of iArchives Inc., and of Randy Silverman, Preservation Librarian at the Marriott Library in the preparation of this manuscript.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Full - Text Access to Historical Newspapers Tapas Kanungo and

Newspapers are rich records of U.S. history. Due to the deterioration of older newspapers, the National Endowment for the Humanities is archiving 19th century newspapers on microfilm. Although microfilm is a good preservation method, it provides limited access to researchers and the general public. We are building a system to provide universal access to digital images and full-text content of h...

متن کامل

Automatic Indexing of Newspaper Microfilm Images

This paper describes a proposed document analysis system that aims at automatic indexing of digitized images of old newspaper microfilms. This is done by extracting news headlines from microfilm images. The headlines are then converted to machine readable text by OCR to serve as indices to the respective news articles. A major challenge to us is the poor image quality of the microfilm as most i...

متن کامل

Google Newspaper Search - Image Processing and Analysis Pipeline

The Google Newspaper Search program was launched on September 8, 2008[1]. In this paper, we outline the technology pieces underlying this large and complex project. We have created a production pipeline which takes newspaper microfilms as input and emits individual news articles as output. These articles are then indexed and added to the content base, so that they turn up in response to Google ...

متن کامل

Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce ’tickle’) focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Leven...

متن کامل

Page 10

Online content-searchable databases of music scores, unlike text databases, are extremely rare. The main reasons are the cost of digitization, the inaccessibility of original music scores and manuscripts, and the lack of sophisticated music recognition software. The proposed research seeks to circumvent these difficulties by investigating the feasibility of using existing microfilms for digitiz...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004